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ABSTRACT 


Biased  estimation  in  regression  has  experienced  a  tremendous  growth 
in  popularity  since  Hoerl  and  Kennard's  formalization  of  ridge  regression. 
Along  with  the  interest  in  biased  regression,  many  research  efforts  over 
the  last  ten  years  have  extended  both  the  theory  and  application  of  these 
methodologies.  So  too,  criticisms  have  arisen  which  focus  on  the  incomplete¬ 
ness  of  the  theoretical  results  and  on  exaggerated  claims  about  the  merits 
of  biased  estimators.  Rather  than  attempting  to  arbitrate  these  opposing 
views,  this  article  discusses  biased  estimation  with  special  emphasis  on 
its  ultimate  justification:  application  to  real  problems.  Advantages  and 
disadvantages  of  three  biased  estimators  (principal  component,  latent  root, 
and  ridge  regression  estimators)  are  discussed  and  illustrated  through  a 
comprehensive  analysis  of  a  data  set  on  automobile  emissions.  From  the  dis¬ 
cussion  and  analysis  it  is  hoped  that  a  more  balanced  perspective  on  the 
application  of  biased  estimation  will  be  fostered. 


1.  INTRODUCTION 


One  of  the  most  important  advances  in  statistical  methodology  over  the 
last  ten  years  has  been  the  introduction  and  refinement  of  biased  regression 
techniques.  Biased  regression  is  not  altogether  new;  some  procedures  (e.g. 
principal  component  regression)  were  popular  long  before  1970.  Nevertheless, 
at  the  outset  of  the  last  decade  least  squares  was  the  (almost)  universally 
accepted  estimation  scheme.  The  major  inqsact  of  the  newer  biased  regression 
methodologies  and  the  resurgence  of  some  older  ones  has  been  their  widespread 
acceptance  and  utilization  as  alternatives  to  least  squares  by  a  very  large 
community  of  regression  analysts  in  a  variety  of  fields  of  application. 

So  widespread  is  the  application  of  biased  regression  that  the  euphoria 
surrounding  its  successful  isplementation  has  generated  exaggerated  claims 
concerning  the  benefits  that  can  be  realized  with  biased  estimators.  Major 
criticisms  of  biased  regression  are  beginning  to  surface,  criticisms  which 
are  necessary  for  a  proper  perspective  of  the  role  of  biased  estimation  in  a 
regression  analysis .  Some  of  the  criticism,  however,  is  based  as  much  on 
the  exaggerated  claims  of  users  of  the  methodologies  as  on  the  true  faults  of 
the  techniques.  Indeed,  some  of  the  current  criticism  is  supported  neither 
by  theoretical  arguments  nor  by  the  views  of  intentions  of  the  original 
authors.  Thus  both  supporters  and  critics  of  biased  regression  are  guilty 
of  arguing  from  unsubstantiated  viewpoints. 

This  article  will  review  three  of  the  more  popular  biased  estimators 
which  have  been  introduced  or  refined  over  the  last  ten  years:  principal 
component  regression,  latent  root  regression,  and  ridge  regression.  The 
scope  of  the  paper  is  intentionally  limited  to  these  estimators  and  to  a 


discussion  of  their  strengths  and  weaknesses.  Shrunken  estimation,  minimax 
estimators,  Bayesian  regression  and  robust  regression  are  additional  estima¬ 
tion  schemes  which  achieved  major  advances  over  the  last  decade  but  the  in¬ 
clusion  of  which  would  render  this  discussion  overly  broad  and  unwieldy. 
Throughout  this  article,  then,  the  term  "biased  regression"  will  refer  gen- 
erically  to  the  three  specific  biased  estimators  mentioned  above. 

2.  BIASED  REGRESSION  AND  LEAST  SQUARES 

Biased  regression  estimators  were  not  proposed  as  replacements  for 
least  squares  estimators.  Draper  and  Van  Nostrand  (1979,  p.  464)  are  correct 
in  insisting  that  "The  extended  inference  that  ridge  regression  is  'always* 
better  than  least  squares  is,  typically,  completely  unjustified."  The  same 
type  of  criticism  cam  be  levied  at  all  regression  estimators,  including  least 
squares:  no  regression  estimator  has  been  shown  to  be  universally  superior 
according  to  all  reasonable  criteria  of  evaluation. 

Another  point  which  must  be  understood  when  assessing  the  merits  of 
biased  estimators  is  that  they  were  not  developed  solely  in  order  to  produce 
regression  estimators  that  have  better  theoretical  properties  than  least 
squares  estimators.  To  be  sure,  biased  estimators  which  can  be  shown  to  have 
decidedly  inferior  theoretical  properties  should  be  discarded,  but  the  "rea¬ 
sonable  criteria"  referred  to  above  must  include  the  judgement  and  insight 
of  the  investigator.  Hoerl  (1962)  proposed  the  ridge  estimator  as  a  means 
of  altering  least  squares  estimates  which  were  found  to  be  unacceptable  from 
a  chemical  engineering  standpoint  almost  a  decade  before  Hoerl  and  Kennard 
(1970a)  buttressed  the  value  of  ridge  regression  with  theoretical  arguments. 
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This  facet  of  the  literature  on  biased  regression  is  often  overlooked 
when  attempts  are  made  to  assess  biased  methodologies.  Many  of  the  advances 
in  biased  estimation  were  stimulated  because  of  the  inability  of  least  squares 
to  produce  acceptable  estimates  on  specific  real  data  sets.  Hoerl  and  Kennard 
(1970a)  begin  their  derivation  of  the  ridge  estimator  by  citing  the  occurrence 
of  such  problems:  "ttie  least  squares  estimates  often  do  not  make  sense  when 
put  into  the  context  of  the  physics,  chemistry,  and  engineering  of  the  pro¬ 
cess  which  is  generating  the  data."  While  it  is  essential  that  theoretical 
properties  of  biased  estimators  be  established  to  insure  they  are  not  just 
ad-hoc  replacements  to  least  squares,  a  major  motivation  for  the  advocacy  of 
biased  regression  is  the  realization  that  properties  of  the  data  sometimes 
preclude  satisfactory  estimation  with  least  squares. 

One  property  of  the  data  which  often  results  in  unacceptable  least 
squares  estimates  is  that  of  multicollinearity .  The  detrimental  impact  of 


multicollinearities  on  least  squares  estimators  are  examined  in  Gunst  and 
Mason  (1977a) ,  Hoerl  and  Kennard  (1970a) ,  and  Silvey  (1959)  and  will  only  be 
outlined  here.  Write  the  regression  model  as 

Y  =»  801  +  X§  +  e  (2.1) 

where  Y  is  an  (nxi)  vector  of  observable  response  variables,  1  is  a  vector  of 
ones,  X  is  an  (nxp)  full-col umn-rank  matrix  of  standardized  (X'X  is  in  corre¬ 


lation  form)  nonstochastic  predictor  variables,  £  is  a  vector  of  unobservable 

2 

error  terms  with  e^^NlD(0,  a  ) ,  and  BQ  and  6  are  the  unknown  constant  term  and 


vector  of  regression  coefficients .  The  least  squares  estimator  of  B  is 

ST_  -  (X'X)-1X'Y.  (2.2) 

"“IiS  • 

Let  the  latent  roots  and  latent  vectors  of  X’X  be  denoted  by  0<£ ,<£„<  ...  <Z 

1-2-  -  p 


and  ,  y2,  ...,  v  ,  respectively.  Then 


ECirilLs1 


8 '  B  +  a2 


P 

Z 

j-1 
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(ii)  VarlV'.tL  1  =  <J  /l. 

-3-1*  3 

P  -1 

(iii)  8__  =  11.  c.V.,  where  c.  =  V'.X'Y. 

-LS  j=l  3  3  -J  - 

Properties  (i)  to  (iii)  illustrate  three  commonly-noticed  difficulties 

with  least  squares  estimators.  If  mul ticollineari ties  exist  in  X,  one  or 

more  of  the  latent  roots  of  X'X  will  be  close  to  zero.  Then  (i)  asserts 

that  the  length  of  the  least  squares  estimator  will  tend  to  be  much  greater 

2 

than  that  of  8  unless  a  is  very  small,  (ii)  reveals  that  some  linear  com¬ 
binations  of  8  will  be  estimated  with  much  worse  precision  than  others,  and 
(iii)  indicates  that  the  magnitudes  (through  large  1)  and  signs  (through 
the  signs  of  the  elements  of  VJ  can  be  greatly  affected  by  multicollineari- 
ties  in  the  data  (see  the  above  references  for  further  discussion  on  these 
properties) .  Of  course,  a  particular  data  set  might  not  exhibit  detrimental 
properties  similar  to  these  and  least  squares  can  yield  acceptable  results. 
The  message  articulated  above  by  Hoerl  and  Kennard  (and  others)  is  that,  in 
their  experience,  these  deleterious  effects  are  often  observed  when  least 
squares  is  utilized  in  conjuntion  with  multicollinear  predictor  variables. 

Gunst  and  Mason  (1977b) ,  Hoerl  and  Kennard  (1970b) ,  and  Marquardt  and 
Snee  (1975)  analyze  data  sets  which  exhibit  one  or  more  of  the  preceeding 
difficulties  and  for  which  the  resulting  least  squares  estimates  appear  to 
be  inconsistent  with  intuitive  or  Known  physical  characteristics  of  the 
respective  studies.  For  completeness  and  later  reference,  a  further  illus¬ 
tration  is  now  provided  utilizing  data  contained  in  Hare  and  Bradow  (1977) . 
In  this  study  the  investigators  sought  to  obtain  correction  factors  in  order 
to  adjust  diesel  truck  emissions  for  ambient  humidity,  atmospheric  pressure. 


and  temperature  levels.  By  using  regression  models,  emissions  readings  could 
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be  adjusted  to  "standard"  humidity,  pressure,  and  temperature  levels.  The 
regression  analysis  examined  here  consists  of  44  experiments  conducted  on  a 
single  diesel  truck.  In  each  experiment  temperature  was  controlled  but  hu¬ 
midity  and  pressure  conditions  were  not.  Table  1  displays  three  least  squares 

fits  of  these  variables  to  nitrous  oxides  (NO  )  emitted  by  the  engine. 

x 

[Insert  Table  1] 

Prior  to  the  analysis,  the  investigators  were  not  able  to  specify  the 

precise  form  of  the  regression  model  but  believed  that  increases  in  humidity 

would  be  associated  with  substantial  decreases  in  NO  and  that  changes  in 

x 

temperature  would  be  associated  with  relatively  minor  changes  in  NO^  (Hare 

and  Bradow,  p.  2571) .  In  Table  1,  three  fits  to  the  NOx  emissions  are  shown. 

The  first  fit  includes  only  the  linear  terms  for  humidity  (H) ,  pressure  (P) , 

and  temperature  (T) ,  the  second  one  adds  interaction  terms  to  the  linear 

ones,  and  the  third  one  also  includes  the  quadratic  terms.  The  coefficients 
2 

of  determination  (R  )  indicate  that  the  latter  two  fits  offer  substantial 
improvement  to  the  linear  fit  and  that  the  full  quadratic  fit  appears  to  be 
a  reasonable  improvement  over  the  interaction  terms.  Disturbing,  however, 
are  the  changes  in  signs  and  magnitudes  of  the  coefficient  estimates  as  terms 
are  added  to  the  linear  fit. 

In  the  linear  fit,  the  coefficient  on  humidity  is  relatively  large  and 

negative  while  the  one  on  temperature  is  much  smaller  -  both  of  which  were 

anticipated  by  the  investigators.  As  the  interaction  and  quadratic  terms  are 

added  the  signs  and  magnitudes  of  the  estimates  fluctuate  wildly  with  humidity 

and  pressure  increasing  dramatically  in  magnitude  and  that  for  temperature 

first  increasing  dramatically  (in  magnitude)  and  then  dropping  just  as  rapidly. 

Note  especially  that  in  the  quadratic  fit  only  the  H*P  coefficient  is  negative 

2 

(and  no  other  coefficient  involving  humidity)  and  that  H,  P,  H*p,  and  P 
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Table  1.  initial  Least  Squares  Fits  to  NO^  Emissions  Data 


Coe f f icient  Estimates 


Predictor 

Variable 

Linear 

Fit 

Linear  and 

Interactions 

Quadratic 

Fit 

£  =  .517xlo"6 

*1 

l2  =  .196x10 

-2 

Humidity  (H) 

-.240 

6.841 

44.738 

-.529 

.061 

Pressure  (P) 

.394 

-2.090 

69.736 

-.448 

-.313 

Temperature  (T) 

.055 

-29.215 

.7  34 

.125 

-.642 

Pxt 

28.090 

-  1.710 

-.129 

.631 

Hxp 

-12.010 

-46.019 

.531 

-.118 

H*T 

5.309 

.797 

-.017 

.077 

P2 

-69.005 

.453 

.261 

T2 

.971 

.009 

-.010 

H2 

.458 

.009 

-.012 

695 


795 


835 
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have  coefficients  that  are  over  two  orders  of  magnitude  larger  than  any  of 
those  in  the  linear  fit. 

As  one  might  surmise,  the  least  squares  estimates  fluctuate  wildly  due 
to  correlations  among  the  humidity  and  pressure  readings.  This  fact  was  un¬ 
anticipated  prior  to  the  experiment  but  explained  afterward  as  ascribable 
to  local  environmental  conditions :  "decreases  in  atmospheric  pressure  are 
frequently  followed  by  southerly  winds  carrying  humid  air  from  the  Gulf  of 
Mexico."  Thus,  while  these  conditions  would  not  routinely  be  expected  to 
exist  in  other  locals ,  during  the  experiment  the  correlation  between  H  and  P 
was  -0.816.  Because  of  this  correlation  and  because  atmospheric  pressure 

did  not  vary  greatly  over  the  duration  of  the  experiment,  the  correlation 

2 

between  H  and  H*p  was  greater  than  0.999,  that  between  P  and  P  was  also 

greater  than  0.999,  and  that  between  T  and  PxT  was  0.996. 

These  three  large  pairwise  correlations  are  also  identifiable  in  the 

latent  vectors  which  correspond  to  the  two  smallest  latent  roots  of  X’X  for 

the  full  quadratic  fit.  Displayed  in  Table  1,  the  two  smallest  latent  roots 

are  very  close  to  zero.  The  four  largest  magnitudes  in  correspond  to  H, 

2 

P,  HXP,  and  P  and  the  two  large  ones  in  correspond  to  T  and  P*T.  Note 

too  that  in  the  magnitudes  of  H  and  HxP  are  about  equal  and  opposite  in 

2 

sign,  those  of  P  and  P  are  about  equal  and  opposite  in  sign,  and  those  of 
T  and  P*T  in  y2  are  about  equal  and  opposite  in  sign.  This  occurrence  is 
characteristic  of  strong  (positive)  pairwise  multicollinearities.  Their 
effect  on  least  squares  estimates  is  likewise  characteristic. 

Observe  that  the  four  least  squares  estimates  with  the  largest  magni¬ 
tudes  in  the  quadratic  fit  correspond  to  the  four  large  elements  in  .  Note 
too  that  the  coefficient  estimates  for  H  and  H*P  are  about  equal  and  opposite 
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in  sign,  as  are  those  of  P  and  P  and  to  a  slightly  lesser  extent  those 
of  T  and  PxT.  That  the  signs  on  the  estimates  are  not  identical  to  those 
in  and  y2  is  unimportant  since  the  latent  vectors  are  only  unique  up  to 
a  multiple  of  -1.  It  is  apparent  that  the  magnitudes  and  signs  of  the  esti¬ 
mates  are  being  determined  by  the  multicollinearities  among  the  predictor 
variables . 

Two  suggestions  are  often  made  concerning  the  possible  rectification 
of  the  problems  just  discovered.  One  is  that  more  data  should  be  collected, 
data  which  do  not  exhibit  these  same  multicollinearities.  In  the  present 
instance  the  suggestion  might  be  impossible  to  accomplish.  Even  if  addi¬ 
tional  time  and  resources  could  be  provided  for  continuation  of  the  experi¬ 
ment,  the  local  climatic  conditions  will  inevitably  produce  the  same  corre¬ 
lations  between  humidity  and  pressure  as  well  as  the  narrow  range  of  pressure 
readings.  Perhaps  all  the  equipment  and  personnel  could  be  moved  to  a  new 
location  but  this  would  be  at  a  major  cost  to  the  contractor.  Regardless  of 
whether  new  experimentation  is  possible,  one  cannot  merely  suggest  that  the 
only  viable  solution  is  to  collect  more  data  without  regard  to  the  feasi¬ 
bility  or  ramifications  of  such  a  suggestion. 

A  second  position  that  is  often  argued  is  that  better  use  should  be 
made  of  a  priori  information  in  the  estimator  itself.  Whenever  the  infor¬ 
mation  is  of  sufficient  refinement  that  it  can  be  so  incorporated,  we  whole¬ 
heartedly  agree.  A  priori  information  is  not  always  so  refined,  however, 
and  is  often  very  crude.  In  this  analysis  one  could  employ  estimators 
which  restrict  the  coefficient  on  humidity  to  be  negative,  consistent  with 
the  investigators'  a  priori  assessment  of  the  effect  of  humidity  on  NO^.  It 
is  more  difficult  and  subjective  to  mathematically  specify  the  relatively 
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lower  importance  of  temperature.  In  addition,  it  is  not  always  clear  how 
this  information  should  generalize  to  models  with  interactions  and  quadratics. 
Finally,  the  correlation  between  humidity  and  pressure,  while  explained  after¬ 
wards,  was  not  anticipated  prior  to  the  experiment.  Often  such  a  correlation 
is  not  explainable  following  an  experiment  but  nevertheless  exists  and  must  be 
•accounted  for  in  the  regression  estimator.  A  variety  of  estimators  cam  do  so. 

To  summarize  this  section,  biased  estimators  are  not  intended  to  re¬ 
place  least  squares.  Least  squares  estimators  have  valuable  theoretical  and 
empirical  properties  as  well  as  a  wealth  of  ancillary  techniques  associated 
with  them  for  variable  selection,  outlier  detection,  model  assessment,  etc. 
Least  squares  will  continue  to  be  the  single  most  important  and  popular  re¬ 
gression  methodology.  With  multicollinear  data,  however,  least  squares  esti¬ 
mates  are  frequently  at  odds  with  known  or  suspected  properties  of  the  model. 
The  cause  of  the  difficulties  can  usually  be  traced  to  the  multicollinearities 
themselves.  Several  remedial  strategies  are  then  available  to  the  regression 
analyst,  one  of  which  is  biased  estimation. 


3.  BIASED  REGRESSION  ESTIMATORS 


Since  the  latent  vectors  of  X'X  span  p-space,  the  parameter  vector  0 
can  always  be  expressed  as 


P 

8  -  Z  a.V.  (3.1) 

-  j-1  3-3 

for  appropriately  chosen  constants  a  ,  a  ,  ...,  a  .  Ultimately,  then,  esti- 

—  —  P 

mation  of  0  consists  of  selecting  suitable  values  for  the  a  in  eqn.  (3.1). 
In  the  last  section  it  was  noted  that 


(X'X) 


-1 


X'Y 


P 

Z  i 
j-1 


-1 


Cj-j' 


(3.2) 


for  latent  vectors 


where  c,  =  V!x'Y.  Because  of  the  large  values  of  l . 

3  ~3  ~  3 

identifying  multicollineariti.es ,  least  squares  estimators  allow  the  first 
few  terms  in  eqn.  (3.2)  to  dominate  the  estimation  of  the  regression  coef¬ 
ficients,  at  least  for  those  coefficients  corresponding  to  multi collinear 
predictor  variables.  Note  that  the  large  values  of  l  1  are  mandated  by 
the  strength  of  the  multicollinearities  and  are  in  no  way  influenced  by  the 
relative  magnitudes  of  the  ct_.  in  eqn.  (3.1). 

All  three  biased  estimators  discussed  below  dampen  the  least  squares 
weights  on  the  V  in  eqn.  (3.2).  In  doing  so,  some  of  the  deleterious  effects 
of  multicollinearities  on  the  least  squares  estimators  can  be  alleviated.  In 
fact,  all  three  biased  estimators  can  be  derived  as  least  squares  estimators 
subject  to  restrictions  on  the  effects  of  the  multicollinearities  (see  Hocking 
(1976)).  In  the  following  subsections,  each  of  the  biased  estimators  is 
derived  and  its  benefits  outlined. 


3.1  Principal  Component  Regression 

~  -1 

From  eqn.  (3.2),  observe  that  V'.  8  =  £.  c..  One  technique  for  de- 

-}-LS  3  j 

riving  the  principal  component  estimator,  6pc,  is  to  minimize  the  residual  sum 
of  squares  (Y  -  Y) ' (Y  -  Y)  subject  to  the  restrictions  V^S  =  0  for  some  sub¬ 
set  of  the  latent  vectors  of  X'X.  Such  an  estimator  completely  eliminates 

the  effect  of  the  latent  vectors  in  the  subset  on  the  estimation  of  3.  Thus, 

* 

B  is  chosen  to  minimize 
“PC 


4>  -  (Y  -  Bq1  -  x8)  '  (Y  -  0  1  -  X8)  +  2  E  U.(V'.B), 

jeD  3  3 

where  D  denotes  the  set  of  latent  vectors  which  are  to  be  deleted  from 
and  the  are  Lagrangian  multipliers.  The  resulting  estimator  is 


jeR 


(3.3) 

B 

(3.4) 
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where  R  is  the  set  of  latent  vectors  that  are  not  forced  to  be  orthogonal  to 

.  The  principal  component  estimator  is  simply  the  least  squares  estimator  (3.2) 
with  a  chosen  set  of  terms  eliminated.  Other  derivations  and  motivations  for 
the  principal  component  estimator  can  be  found  in  Kendall  (1957),  Massy  (1965), 
and  Bock,  Yancey,  and  Judge  (1973) . 

Two  criteria  are  often  cited  for  selecting  the  terms  to  eliminate  from 


(i)  delete  terms  with  suitably  small 

(ii)  delete  terms  which  do  not  sufficiently  aid  prediction  of 
the  response. 

The  first  criterion  eliminates  terms  solely  because  they  are  associated  with 
multicollinearities .  The  second  one  attempts  to  assess  whether  the  terms, 
multicollinear  or  not,  have  predictive  value,  typically  utilizing  individual 
P  statistics 

pj  •  VW2/?  (3-5> 

or  by  a  simultaneous  F  test 

p»  ■  *  wV  l3-6’ 

where  the  summation  in  eqn.  (3.6)  includes  all  terms  which  are  candidates  for 
deletion  (i.e.,  all  those  in  the  set  D)  and  nQ  is  the  number  of  components 
deleted.  The  respective  F  statistics  in  eqns.  (3.5)  and  (3.6)  are  uniformly 
most  powerful  for  the  hypotheses  they  test,  V^6  *  0  and  6  *  0  jcD,  re¬ 
spectively.  If  the  simultaneous  test  using  FQ  is  chosen,  one  frequently 
includes  all  terms  which  correspond  to  multicollinearities  that  have  been 
identified  but  it  is  not  necessary  to  do  so  or  to  exclude  other  terms. 

Bock,  Yancey,  and  Judge  (1973)  studied  mean  squared  error  (risk) 
properties  of  least  squares  and  preliminary  test  estimators,  of  which  the 
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principal  component  estimator  using  eqn.  (3.6)  is  a  special  case.  They  showed 
that  the  mean  squared  error  of  the  principal  component  estimator  is  smaller 
than  that  of  least  squares  for  small  values  of 

9  -  S  l  (V!B)2/2n_a2,  (3.7) 

jeD  3  ~3~  D 

larger  for  moderate  values  of  9,  and  about  equal  to  that  of  least  squares 
for  very  large  values  of  0.  Observe  that  9  is  the  noncentrality  parameter 
of  Fq  and  large  values  of  0  lead  to  the  retaining  of  multicollinear  terms  in 
8pc  through  the  rejection  of  the  hypothesis  that  they  have  no  influence  on 
the  response. 

A 

Clear  advantages  acrue  from  the  use  of  8pc  with  either  criterion  (i) 
or  (ii)  when  one  can  ascertain  that  6  is  orthogonal,  or  nearly  so,  to  the 
Vj  which  identify  multicollinearities  in  X  (i.e.,  when  9  is  small).  Either 
intuition,  previous  estimates  with  nonmulticollinear  data  sets,  or  theoretical 
knowledge  of  the  model  could  lead  one  to  conclude  that  the  coefficient  vector 
is  orthogonal  to  one  or  more  of  the  multicollinearities.  If  so,  the  prin¬ 
cipal  component  estimator  can  be  effectively  employed,  perhaps  with  the  choice 
of  criterion  (i)  or  (ii)  depending  on  one's  degree  of  confidence  in  the  or¬ 
thogonality  of  8  to  the  multicollinearities. 

The  degree  of  one's  confidence  in  the  orthogonality  of  8  to  the  multi¬ 
collinearities  can  also  be  reflected  in  the  choice  of  significance  levels 
for  the  F  statistics  (3.5)  and  (3.6).  The  greater  the  certitude  of  one's 
belief  in  the  orthogonality  of  8  to  the  multicollinearities,  the  smaller  one 
should  choose  the  significance  level.  In  other  words,  the  greater  one's 
certitude  the  stronger  the  empirical  evidence  should  be  to  cause  one  to  retain 
the  multicollinear  terms.  From  another  argument  the  same  conclusion  can  be 
drawn.  Since  the  principal  component  estimator  using  either  (i)  or  (ii)  as  a 
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criterion  has  smaller  mean  squared  error  than  least  squares  for  small  values 
of  9,  only  large  values  of  9  lead  one  to  desire  to  retain  multicollinear 
terms.  Thus,  rejection  of  HQ  when  small  significance  levels  are  utilized 
provides  greater  assurance  that  the  multicollinear  terms  should  be  retained 
in  the  estimator  regardless  of  one's  possible  misgivings  about  the  effects 
of  the  multicollinearities  or  one's  belief  about  the  orthogonality  of  6  to 
the  multicollinear  .  If  one  fails  to  reject,  the  multicollinear  terms 
should  be  deleted. 

While  this  strategy  is  appealing  in  several  respects,  additional  proper¬ 
ties  of  the  F  statistics  sometimes  render  inferences  unreliable.  Since  9  can 
be  rewritten  in  terms  of  the  in  eqn.  (3.1)  as 

9  =  E  A.  i.W,  (3.8) 

j£D  3  3  D 

it  is  clear  that  small  latent* roots  can  greatly  dampen  the  noncentrality 

parameter  of  the  F  tests  over  that  of  an  orthogonal  X  matrix  (all  =  1.0). 

As  an  illustration  of  the  effect  of  small  latent  roots  on  the  power  of  these 

tests.  Figure  1  graphs  power  curves  for  a  test  of  H^sa^  =>  0  vs  Ha:o^  ^  0  for 

two  choices  of  The  power  of  the  test  is  greatly  reduced  when  changes 

from  1.0  to  .01  and  can  render  the  test  virtually  useless  for  smaller  if 
.  2 

is  not  very  much  larger  than  a  .  The  danger  here  is  that  the  F  test  will 
have  poor  power  because  of  the  multicollinearity  (small  i.^)  and  not  because 
is  zero  or  negligible.  Nonrejection  of  HQ  could  cause  the  elimination  of 
terms  from  8  that  are  necessary  for  adequate  estimation  of  6.  One  could 
compensate  for  this  problem  by  employing  large  significance  levels  but  the 

A  A 

variance  reduction  properties  of  8^,  over  8._  and  the  discussions  of  the  pre- 

“PC  “LS 


vious  paragraphs  generally  dictate  that  small  significance  levels  be  used. 


•rr 
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Consider  again  the  summary  information  in  Table  1.  Recall  that  the 

2 

pariwise  correlation  coefficient  between  P  and  P  is  greater  than  0.999. 

With  such  a  large  positive  correlation,  one  might  expect  that  the  true  co¬ 
efficients  should  be  about  equal  and  the  same  sign.  One  very  important 
feature  of  the  pressure  readings  that  was  mentioned  earlier  is  that  they  are 
measured  over  a  narrow  range.  The  coefficient  of  variation,  S/X,  for  the 
pressure  readings  is  0.216/29.218  =  0.007.  This  explains  the  large  positive 
pairwise  correlations  between  H  and  H*P  (greater  than  0.999)  and  between  T 
and  Pxt  (0.996).  One  might  again,  therefore,  expect  that  the  true  coefficients 
for  the  respective  pairs  of  predictor  variables  (H  and  H*P,  T  and  P*T)  should 
be  of  the  same  sign  and  about  equal  in  magnitude.  Thus  the  true  coefficient 
vector  8  should  be  orthogonal  to  both  and  ,  a  property  not  reflected  in 
the  least  squares  estimates  because  of  the  greet  influence  of  yx  and  y2  on 


In  contrast  to  the  least  squares  estimates  are  the  principal  component 
estimates,  displayed  in  Table  2.  First  consider  whether  to  delete  the  two 
latent  vectors  discussed  above.  The  arguments  posed  in  the  last  paragraph 
could  lead  one  to  eliminate  and  without  performing  a  preliminary  test. 
If  one  was  uncertain  as  to  the  correctness  of  those  arguments,  the  summary 
information  given  in  Table  2  on  F^,  F2>  and  FQ  (deleting  components  1,  2)  all 
suggest  the  same  result:  delete  both.  Now  observe  that  when  is  deleted, 
the  signs  on  H,  H*p,  and  H*T  are  all  reversed,  with  H  and  H*T  becoming  nega¬ 
tive.  The  magnitudes  on  many  of  the  estimates  are  reduced  as  well.  This 
trend  continues  when  is  eliminated.  It  is  reassuring  to  observe  that  when 

and  are  deleted  humidity  has  the  largest  coefficient  and  its  sign  is 

2 

negative,  but  the  relative  magnitudes  of  temperature,  P*T,  H*T,  and  P 

2 

as  well  as  the  signs  on  H*P  and  P  are  still  disturbing  (H*P  should  have 
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2 

the  same  sign  as  H,  and  P  should  have  the  same  sign  as  P) .  It  is  now  time 
to  examine  the  remaining  latent  roots  and  latent  vectors  of  X'X. 

[Insert  Table  2] 

Table  3  contains  four  additional  latent  roots  and  the  corresponding 

latent  vectors  of  X'X.  Carefully  examining  V3,  one  observes  that  the  ele- 
2 

ments  for  H  and  P  are  about  equal  in  magnitude  and  of  the  same  sign  (re¬ 
flecting  the  negative  correlation  between  H  and  P  and  the  equivalence  of  P 
2 

and  P  ) ,  those  of  P  and  H*P  are  about  equal  in  magnitude  and  of  the  same 
sign  (again  indicating  the  negative  correlation  between  P  and  H  since  H  is 
highly  correlated  with  H*p) ,  and  the  elements  corresponding  to  T  and  PxT 
are  about  equal  and  opposite  in  sign  (showing  the  correlation  between  T  and 
PXT) .  One  can  again  argue  that  3  is  orthogonal  to  y3  or  use  F  tests  (see 
Table  2)  to  conclude  that  V3  should  be  deleted.  When  V^,  V^,  and  V3  are 
deleted  from  8pc,  the  signs  and  magnitudes  of  the  estimated  coefficients 
appear  to  be  consistent  with  the  investigators'  a  priori  knowledge,  except 
perhaps  that  the  magnitude  of  temperature  is  too  large.  Note,  however,  the 
large  drops  in  magnitudes  when  V3  is  deleted. 

(Insert  Table  3] 

The  remaining  three  latent  vectors  displayed  in  Table  3  can  be  assessed 

similarly  to  the  above  three.  One  of  these  vectors,  however,  deserves  special 

note.  The  sixth  latent  vector  corresponds  to  a  latent  root  which  is  five 

orders  of  magnitude  larger  than  the  smallest  one.  Its  elimination  will  not 

result  in  as  substantial  a  reduction  in  variance  as  any  of  the  previous  la- 

2 

tent  vectors.  The  large  elements  in  Vg  correspond  to  H,  Hxp,  and  H  ,  all 
of  which  involve  humidity.  Since  humidity  is  believed  to  be  an  in$>ortant 
predictor  of  N0x  emissions,  deletion  of  this  latent  vector  could  result  in 
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Table  2.  Principal  Component  Estimates  for  NO 


x 


Emissions  Data, 


Quadratic  Fit 


Predictor  _ _  Components  Deleted 


Variable 

None 

Xi 

V  ,  V 
-1  -2 

*1  t0  -3 

*i  to  y4 

Xl  t0  -5 

Zl  t0  -6 

Humidity  (H) 

44.738 

-13.547 

-11.520 

-1.178 

-1.372 

-.561 

-.060 

Pfcessure  (P) 

69.736 

20.371 

9.967 

.175 

.136 

.068 

.224 

Temperature  (T) 

.734 

14.551 

-6.754 

-.887 

-.248 

-.006 

.007 

pxT 

-1.710 

-15.973 

4.962 

-1.082 

-.530 

-.007 

.027 

H*P 

-46.019 

12.525 

8.598 

-1.165 

-1.358 

-.600 

-.059 

H*T 

.797 

-1.059 

1.506 

1.337 

1.744 

-.296 

-.048 

P2 

-69.005 

-19.036 

-10.374 

.101 

.081 

.083 

.224 

T2 

.971 

1.924 

1.581 

1.878 

.627 

.077 

.007 

H2 

.458 

1.408 

1.008 

.701 

.706 

1.047 

-.006 

R2 

.835 

.823 

.818 

.808 

.805 

.775 

.690 

a 

.00241 

.00252 

.00251 

.00258 

.00256 

.00287 

.00385 

Component  Deleted 

1 

2 

3 

4 

5 

6 

Zj 

.517xl0~6 

.196xl0~5 

•109xl0“4 

.647X10"3 

.239x10“ 2 

.239X10-1 

F . 

3 

2.612 

.897 

2.168 

.677 

6.108 

17.556 

Significance  Probability 

. 10<p< . 25 

.  25<p 

.10<p< . 25 

.25<p 

.01<p< .025 

p< .001 

Components  Deleted 

fd 

1,2 

1.754 

1,2,3 

1.892 

1,2, 3,4 

1.588 

1,2, 3, 4, 5 

2.492 

1,2, 3, 4, 5, 6 

5.003 

Significance  Probability 

. 10<p< . 25 

.10<p<.25 

.10<p<.25 

•10<p<.25 

p  :  .001 
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Table  3.  Additonal  Latent  Vectors  of  X'X,  Quadratic  Model 


Predictor 

Variable 

l  *  .109x10~4 

?3 

l.  -  .647x10_3 

4 

-4 

*5  -  . 239x10~2 

-5 

=  .239X10' 

6 

-6 

Humidity  (H) 

-.473 

.123 

-.327 

-.378 

Pressure  (P) 

.447 

.024 

.027 

-.117 

Temperature  (T) 

-.268 

-.402 

-.098 

-.010 

PXT 

.276 

-.347 

-.211 

-.025 

HXP 

.446 

.121 

-.305 

-.407 

HXT 

.008 

-.257 

.823 

-.187 

P2 

-.479 

.013 

-.001 

-.107 

T2 

-.014 

.788 

.221 

.053 

H2 

.014 

-.003 

-.138 

.793 

large  bias.  Coupled  with  the  relatively  smaller  reduction  in  variance  from 
deleting  Vg  relative  to  the  other  latent  vectors,  the  potential  for  inducing 
bias  suggests  that  it  should  be  examined  carefully  before  allowing  its  re¬ 
moval.  The  F  statistic  in  Table  2  confirms  that  this  component  is  significant 
at  extremely  small  significance  levels  and  should  be  retained. 

Based  on  these  considerations,  the  principal  component  estimates  ul¬ 
timately  selected  consisted  of  the  elimination  of  through  V^.  The  standard¬ 
ized  estimates  shown  in  column  seven  of  Table  2  have  relatively  large 
magnitudes  on  all  the  humidity  terms,  although  the  magnitudes  are  greatly  re¬ 
duced  from  the  least  squares  quadratic  fit  and  are  of  the  same  relative  size  as 
those  in  the  linear  fit.  The  signs  on  the  humidity  terms  are  also  consistent 
with  the  investigators'  belief  that  humidity  should  have  a  negative  effect  on 
N0x>  The  other  variables,  perhaps  with  the  exception  of  T  and  F*T,  are  seen 
to  have  a  lesser  but  nonnegligible  effect  on  NO^. 

From  the  detailed  nature  of  the  foregoing  analysis,  it  should  be  apparent 
that  biased  estimation  is  neither  routine  nor  automatic.  The  desire  of  some 
data  analysts  for  an  automated  analysis  is  both  a  source  of  criticism  of  biased 
estimation  and  an  invitation  to  poor  results.  On  the  other  hand,  careful  and 
detailed  examination  of  the  data  base,  and  in  particular  multicollinearities, 
leads  to  a  better  understanding  of  the  resulting  estimates  and  more  fruitful, 
consistent  conclusions. 

3.2  Latent  Root  Regression 

Latent  root  regression  was  spawned  by  a  desire  to  more  effectively  assess 
multicollinearities.  Rather  than  examining  only  the  latent  roots  and  latent 
vectors  of  X'X,  latent  root  regression  also  investigates  the  latent  roots  and 
latent  vectors  of  A'A,  the  (p  +1)  x  (p  +  1)  matrix  of  correlations  of  response 
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and  predictor  variables;  i.e.,  let  A  =  [Y*:X]  and  Y*  =  n  ^ (Y.  -  Y) ,  where 
2  -2 

n  =  E (Y^  -  Y)  .  Multicollinearities  are  first  identified  by  studying  the 
pairwise  correlations  in  X'X,  the  variance  inflation  factors  (Marquardt 
(1970),  Marquardt  and  Snee  (1975)),  and  the  latent  roots  and  latent  vectors 
of  X'X.  Multicollinearities  are  then  assessed  by  analyzing  the  latent  roots 
and  latent  vectors  of  A* A. 

Let  0  £  X„  £  X,  ^  . . .  <  X  denote  the  latent  roots  of  A’ A  and  Y, , 

0  1  p  —0  —1 

. ..,  y  the  corresponding  latent  vectors.  Partition  y.  as  y!  =  (y  6!)  so 

—p  “1  *)  “1 

that  6^  contains  the  last  p  elements  of  Y j •  If  X^  is  sufficiently  close  to 
zero,  y j  identifies  a  multicollinearity  in  A  just  as  i.  =  0  points  out  a 
multicollinearity  of  X  that  is  defined  by  V  ^ .  If,  in  addition  to  X^  suf¬ 
ficiently  close  to  zero,  the  magnitude  of  yQ_.  is  also  close  to  zero,  y  ^ 
identifies  a  multicollinearity  which  involves  only  the  predictor  variables 
and  not  the  response,  a  "nonpredictive"  multicollinearity  (see  Webster,  Gunst, 
and  Mason  (1974)  and  White  and  Gunst  (1979) ) .  Latent  root  regression  elimi¬ 
nates  nonpredictive  multicollinearities  from  the  resulting  estimator. 

In  terms  of  the  <5..,  the  latent  root  estimator  is  given  by 

iu.  ■  where  fj  ■  -"V.VoM1-  <3-9) 

jeR  J  J  J  igR 

* 

In  eqn.  (3.9)  as  in  the  formula  for  Bpc,  R  refers  to  the  set  of  indices  of 
latent  vectors  that  are  retained  in  the  latent  root  estimator;  i.e.,  all 
latent  vectors  except  those  which  identify  nonpredictive  multicollinearities. 
Hie  latent  root  estimator  minimizes 


♦  -  (Y  -  BqI  -  X8)  ’  (Y  -  SQ1  -  X8)  +  2  Z  («’ 6  -  ny^)  ,  (3.10) 


which  is  similar  in  form  to  eqn.  (3.3),  especially  since  yQ^  is  close  to  zero 
for  nonpredictive  multicollinearities.  If  the  multicollinearities  of  A' A 


are  different  from  those  of  X'X,  however,  the  latent  root  estimator  can  be 
quite  different  from  both  the  least  squares  and  the  principal  component 
estimator . 

As  expressed  in  eqn.  (3.9)  the  latent  root  estimator  is  a  complicated 

function  of  the  response  variable  and  its  distributional  properties  are 

currently  unknown.  Apart  from  the  intuitive  appeal  of  being  able  to  assess 

the  predictiveness  of  multicollinearities  and  adjust  for  nonpredictive  ones, 

the  efficacy  of  8  has  been  argued  primarily  from  simulation  studies  (see 
-LR 

Gunst,  Webster  and  Mason  (1976)  and  Gunst  and  Mason  (1977a) ) ,  as  is  the  case 
for  many  other  biased  estimators.  Recently,  White  and  Gunst  (1979)  developed 
asymptotic  inference  procedures  which  can  be  utilized  to  refine  the  assess¬ 
ment  of  multicollinearities  when  sample  sizes  are  large. 

One  facet  of  latent  root  regression  suggests  that  it  is  a  valuable  al¬ 
ternative  to  principal  component  regression.  Not  only  can  multicollinearities 
in  X'X  be  assessed  for  their  predictive  value,  occasionally  new  multicolli¬ 
nearities  appear  in  A'A  which  are  different  from  those  of  X'X.  Necessarily 
these  multicollinearities  involve  the  response  variable  and  generally  have 
predictive  value.  They  can  take  on  at  least  two  forms:  "crossovers"  and 
"distortions." 

A  crossover  occurs  if  one  of  the  latent  vectors  of  X'X,  say  v^,  dis¬ 
places  another  one,  say  V  ,  in  A'A  when  l  <  4,  .  For  example  if  V.  corresponds 
to  the  smallest  latent  root  of  X'X  and  y2  corresponds  to  the  next  smallest 
one  *2,  yet  in  A'A  5Q  :  y2  and  6^  ~  V^,  a  crossover  is  said  to  have  occurred 
("crossover"  since  now  V2  =  SQ  corresponds  to  the  smallest  latent  root  of  A'A 
and  -  <5^  to  the  second  smallest) .  Regardless  of  the  apparent  magnitude 
°f  yqq,  the  elements  of  yQ  identify  a  predictive  multicollinearity.  This  is 
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because  if  XQ  <  and  YqQ  is  actually  zero, 

\Q  =  YqA'AYq  =  5qX'x60  <  V|X'XV1  =  l  . 

But  this  is  a  contradiction  since  is  the  unique  unit  length  vector  which 
minimizes  the  quadratic  form  u'X'Xu.  The  reason  for  the  caution  about  the 
apparent  magnitude  of  Yqq  is  that  if  f?  is  a  linear  combination  of  only  multi - 
collinear  latent  vectors  of  X'X  (those  associated  with  small  latent  roots), 
the  first  element  of  ^  can  be  small  even  if  yq  identifies  a  predictive  multi - 
collinearity  (see  White  and  Gunst  (1979)). 

Similarly,  if  8  is  a  linear  combination  of  some  multicollinear  latent 

vectors  of  X'X  and  some  nonmulticollinear  ones,  the  multicollinearities  in 

A'A  can  be  distortions  of  those  of  X'X.  For  example,  suppose  both  and  a 

2 

are  large  in  eqn.  (3.1)  and  all  other  a_.  are  zero.  If  a  is  sufficiently 
small,  <5q  will  be  approximately  proportional  to  +  apYp  and  none  of  the  other 

<5 j  will  be  approximately  equal  to  V^.  Thus  is  "distorted"  in  the  latent 
vectors  of  A'A  although  the  other  multicollinear  latent  vectors  of  X'X  might 
not  be. 

The  value  of  this  information  is  that  one  need  not  restrict  inferences 
on  multicollinearities  to  the  individual  latent  vectors  of  X'X  or  to  a  linear 
combination  of  only  the  multicollinear  ones,  as  in  eqns.  (3.5)  and  (3.6).  If 
8  is  actually  equal  to  one  of  the  or  a  linear  conbination  of  only  the 
multicollinear  ones,  analysis  of  predictive  multicollinearities  using  the 
latent  roots  and  latent  vectors  of  A'A  cannot  have  greater  statistical  power  than 
the  appropriate  F  statistics.  However,  an  analysis  of  the  latent  roots  and  vec¬ 
tors  of  A'A  can  potentially  detect  whether  6  only  partially  involves  the 
latent  vectors  associated  with  multicollinearities. 


Turning  now  to  an  analysis  of  the  emissions  data.  Table  4  lists  the 
six  smallest  latent  roots  of  A' A  and  their  associated  latent  vectors.  Com¬ 
paring  these  latent  roots  and  latent  vectors  with  those  of  X'X,  neither 
crossovers  nor  distortions  are  apparent.  All  six  latent  vectors  are  associ¬ 
ated  with  small  latent  roots  but  the  first  element  of  y ^  is  clearly  not 
close  enough  to  zero  to  be  judged  a  nonpredictive  multicollinearity.  This 

finding  supports  the  conclusions  drawn  above  on  V.  from  the  principal  com- 

—  o 

2 

ponent  analysis  and  is  again  due  to  the  large  weight  on  H  in  the  latent 
vector  as  well  as  the  moderately  large  weights  on  H,  H*p,  and  H*T. 

[Insert  Table  4] 

The  very  small  values  for  the  first  four  latent  vectors  in  Table 
4  and  the  discussions  in  the  previous  section  of  the  multicollinearities 
lead  immediately  to  the  elimination  of  these  latent  vectors  from  fi  . 

Whether  y4  should  also  be  eliminated  is  questionable.  Its  first  element 
is  larger  than  any  of  the  previous  four  but  the  decision  of  whether  it  is 
sufficiently  close  to  zero  to  be  labeled  nonpredictive  is  unclear.  Since 
the  three  elements  of  y  that  have  large  weights  all  involve  humidity,  one 
might  prefer  to  leave  this  vector  in  the  estimator.  On  the  other  hand,  one 
could  argue  as  follows  that  the  multicollinearity  is  theoretically  nonpredic¬ 
tive.  As  with  and  y$  (as  well  as  V^,  V^,  and  Vg) ,  ^4  is  a  three-variable 
multicollinearity  which  is  actually  reflecting  two  two-variable  ones.  The 
correlation  between  H  and  Hxt  is  .992  and  that  between  Hxp  and  HxT  is  .992, 
and  X5  =  Xg.  Implied  by  these  multicollinearities  is  that 
X^  +  Xg  s  2Xg  or  2Xg  -  X^  -  X  *  0.  Observe  that  the  magnitude  of  the  ele¬ 
ment  of  corresponding  to  Hxt  is  roughly  twice  the  size  and  opposite  in 

sign  of  that  of  H  and  HxT,  just  as  the  relationship  2X,  -  X,  -  X_  indicates 

O  1  3 

it  should  be.  Now  due  to  the  large  pairwise  correlations  mentioned  above. 


so  that  =  Xg 


Table  4.  Six  Latent  Roots  and  Latent  Vectors  of  A'A,  Quadratic  Fit 
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one  might  expect  that  Sg  -  8^  and  8g  -  8^  so  that  8'<$4  -  0.  Thus  one  could 

argue  for  the  elimination  of  y^  as  well  as  y^  through  y^ 

Table  5  displays  the  latent  root  estimates  of  8  as  the  first  six  latent 

vectors  are  sequentially  deleted  from  8__.  In  the  fifth  set  of  estimates, 

—LR 

yQ  through  y4  have  been  removed.  All  four  humidity  predictor  variables  have 

*  A  A  A 

large  coefficient  estimates  with  8,  s  8,  and  8,  =  6,.  The  estimates  are  not 

O  1  0  3 

inconsistent  with  the  investigators'  a  priori  knowledge  nor  are  they  sub¬ 
stantially  different  from  the  principal  component  estimates.  This  is  not 
surprising  since  neither  crossovers  nor  distortions  occurred  in  this  data 
set  among  the  latent  vectors  of  X'X  and  A' A  that  identify  multicollinearities . 

[Insert  Table  5] 

A  major  defect  of  the  latent  root  estimator  is  the  lack  of  exact  dis¬ 
tributional  theory.  One  must  rely  on  a  careful  examination  of  the  data  base 
and  diagnostic  statistics  that  are  associated  with  the  methodology  in  order 
to  draw  adequate  inferences  with  latent  root  regression.  Even  if  the  asymp¬ 
totic  theoretical  properties  cam  be  invoked  to  assist  in  drawing  inferences, 
a  careful  examination  of  X'X,  its  latent  roots  and  vectors,  the  latent  roots 
and  vectors  of  A'A,  etc.  is  indispensable  and  cannot  be  automated.  Compensa¬ 
ting  for  the  added  effort,  however,  is  a  better  comprehension  of  the  data 
base  and  its  limitations  and  a  far  more  informed  understanding  of  the  regres¬ 
sion  estimates,  including  their  signs  and  magnitudes. 

3.3  Ridge  Regression 

Perhaps  the  single  most  influential  catalyst  to  the  current  widespread 
popularity  of  biased  regression  estimation  is  Hoerl  and  Kennard's  (1970a,  b) 
development  of  ridge  regression.  By  singly  adding  a  small  positive  quantity 
k  to  the  diagonal  elements  of  X'X,  an  entire  family  of  estimators  can  be 


generated: 
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Table  5.  Latent  Root  Estimates  for  NO^  Emissions  Data,  Quadratic  Fit 


Predictor  _ Components  Deleted 


Variable 

None 

*0 

-0'  *1 

10  to  12 

*0  t0  *3 

I0  to  I4 

I0  to  1\ 

Humidity  (H) 

44.738 

-14.127 

-11.573 

-1.167 

-1.373 

-.509 

-.003 

Pressure  (P) 

69.736 

19.504 

9.604 

.157 

.119 

.063 

.237 

Temperature  (T) 

.734 

14.186 

-6.935 

-.883 

-.195 

.017 

.013 

P*T 

-1.710 

-15.624 

5.136 

-1.085 

-.487 

.030 

.036 

H*P 

-46.019 

13.060 

8.632 

-1.175 

-1.374 

-.553 

.002 

HXT 

.797 

-1.015 

1.532 

1.337 

1.761 

-.401 

-.042 

P2 

-69.005 

-18.204 

-10.026 

.120 

.093 

.088 

.237 

T2 

.971 

1.927 

1.581 

1.877 

.527 

.028 

-.008 

H2 

.458 

1.407 

1.003 

.701 

.708 

1.049 

-.111 

R2 

.835 

.822 

.818 

.808 

.804 

.771 

.671 

'2 

a 

.00241 

.00252 

.00251 

.00258 

.00256 

.00291 

.00409 
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6  “  (X'X  +  kI)"lx’¥ 

P  -1 

=  Z  (l  +  k)  c.V  .  (3.11) 

j*l  3  3  3 

These  estimators  include  least  squares  (k  ■  0)  and  the  null  vector  (k  =«)  as 
extremes.  From  eqn.  (3.11)  one  can  see  that,  unlike  principal  component  and 
latent  root  regression  analyses,  the  ridge  estimator  deletes  none  of  the  la¬ 
tent  vectors  of  X'X  from  the  analysis  but  decreases  the  weights  (from  least 
squares)  on  each  of  them.  In  this  manner  the  effects  of  multicollinearities 
cure  lessened. 

Several  important  theoretical  results  are  provided  by  HOerl  and  Kennard 
(1970a) .  First,  the  ridge  estimator  can  be  derived  by  minimizing  the  resi¬ 
dual  sum  of  squares  subject  to  a  fixed  value  of  8‘6;  i.e.,  8  minimizes 

“  "  — RR 

*  -  (Y  -  0Q1  -  X8) ' (Y  -  8Q1  -  xj)  +  u (S' 8  -  r)  .  (3.12) 

In  the  minimization  of  (3.12)  ,  the  lagrangian  multiplier  u  becomes  k  in  8^- 

A 

An  equivalent  derivation  of  8^  minimizes  the  length  of  the  coefficient  vector 
subject  to  a  fixed  increase  in  the  residual  sum  of  squares  over  that  of  least 
squares;  i.e.,  8  minimizes 

"RR 

<t>  *  8 ' 8  +  u  [(8  -  S.J'X’xfB  -  8tJ  -  s].  (3.13) 

“  “  -Lo  —  “liS 

Again,  8^  minimizes  eqn.  (3.13)  with  ji  ■  k  \  Both  of  these  properties  of 
ridge  estimators  are  useful  characterizations;  moreover,  if  the  researcher 
can  specify  values  for  r  in  eqn.  (3.12)  or  s  in  eqn.  (3.13),  unique  solutions 
for  k  can  be  obtained. 

Often  considered  the  most  important  -  and  most  controversial  -  rationale 
to*  ridge  regression  is  the  "existence  theorem"  proven  by  Hoerl  and  Kennard 
(1970a) .  The  authors  show  that  there  always  exists  a  range  of  values  of  the 
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ridge  parameter  (k)  for  which  mse(8  )  <  mse(8  ),  where 

■“RR  ~  LS 

AAA 

mse (3)  =  E[(8  -  8) ' (3  -  B) ] .  (3.14) 

Two  recurring  criticisms  of  the  existence  theorem  are  (i)  the  range  on 
k  depends  on  the  unknown  model  parameters,  and  (ii)  the  criterion  (3.14)  is 
only  one  of  many  important  criteria  for  assessing  regression  estimators  (e.g., 

A.  a 

E[(B  -  8) 'X'X(B  -3)1  is  another).  Progress  has  been  made  in  answering  criti¬ 
cism  (ii);  e.g.,  Theobald  (1974)  generalized  the  existence  theorem  to  include 
any  criteria  of  the  form 

mse(8)  =*  E  [  (B  -  3)  ’M(6  -6)1, 
where  M  is  at  least  positive  semidef inite . 

No  completely  satisfactory  response  has  been  forthcoming  to  the  first 
criticism.  Indeed,  further  controversy  has  emanated  from  a  related  problem, 
procedures  for  selecting  the  ridge  parameter.  Most  of  the  more  popular  choices 
of  k  rely  on  stochastic  techniques  (ridge  trace,  estimation  of  the  upper  bound 
suggested  by  the  existence  theorem,  etc . )  and  thereby  invalidate  the  applica¬ 
tion  of  the  existence  theorem.  No  ridge  estimator  has  yet  been  devised  which 
can  guarantee  a  smaller  mean  squared  error  than  least  squares  for  all  model 
configurations . 

Admission  of  this  last  point  does  not,  however,  negate  the  advantages 
of  ridge  regression  when  predictor  variables  are  mul ti coll inear .  Least  squares 
parameter  estimates  such  as  those  presented  in  Table  1  for  the  interaction 
and  quadratic  fits  are  clearly  unacceptable.  The  reasons  for  the  patterns 
in  the  signs  and  magnitudes  have  been  traced  directly  to  the  multicollinearities 
in  X  and  cannot  be  attributable  to  the  true  functional  relationships  between 
response  and  predictor  variables.  While  it  is  true  that  ridge  regression  can¬ 
not  guarantee  a  smaller  mean  squared  error  than  least  squares,  especially  for 


this  severely  multicollinear  a  data  set  it  is  also  true  that  least  squares 
cannot  guarantee  a  smaller  mean  squared  error  than  ridge  regression.  Ridge 
estimates  might,  however,  produce  far  more  reasonable  coefficient  estimates 
with  no  essential  increase  in  the  residual  sum  of  squares.  Let  us  expand 
on  this  statement. 

The  residual  sum  of  squares  for  ridge  estimators  can  be  written  as 

“V*  ■  -  *oi  -  k5r*>  '  ‘X  -  »oi  - 

-  «  -  ®0-  -  “ls>  '  <x  -  »oi  -  xits’  ♦  <I«  -  £„>  'x'x(5t»  -  I 
*  SSELS  ♦“'Ssr'- 

where  w(B  )  is  the  increase  in  the  residual,  sum  of  squares  over  that  of  least 

"RR 

squares.  An  alternative  expression  for  (u  (8__)  is 

"  RK 


(3. 15) 


If  k  is  chosen  suitably  small,  eqn.  (3.11)  shows  that  only  the  weights  on 
latent  vectors  of  X'X  corresponding  to  very  small  latent  roots  will  be  sub¬ 
stantially  altered  from  those  of  least  squares.  Those  latent  vectors,  however, 
receive  the  smallest  weights  in  eqn.  (3.15).  Thus  the  ridge  estimator  can 
cause  and  8^  to  differ  greatly  in  dimensions  corresponding  to  multi- 
collinearities  in  X  while  not  appreciably  increasing  the  residual  sum  of 
squares  over  that  of  least  squares.  This  is  further  evidence  of  the  tenuous 
nature  of  least  squares  estimates  when  predictor  variables  are  multicollinear: 
there  can  be  a  number  of  estimates  which  are  numerically  quite  different 
from  least  squares  but  whose  residual  sums  of  squares  are  for  all  practical 
purposes  just  as  small.  This  notion  of  directions  in  estimation  space  for 


which  coefficient  estimates  can  change  radically  without  markedly  increasing 
the  residual  sum  of  squares  is  the  same  principle  which  underlies  "ridge 
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analysis"  of  response  surfaces  and  was  one  of  the  first  motivations  of 

ridge  regression  (see  Hoerl  (1962)). 

Table  6  displays  ridge  estimates  for  several  choices  of  the  ridge 

parameter,  k.  All  of  the  sets  of  estimates  shown  in  the  table  tend  to  have 

magnitudes  that  are  greatly  reduced  from  least  squares  but  which  are  of 

the  same  order  as  the  least  squares  fit  to  only  the  linear  terms.  In  all 

2 

the  ridge  estimates  H,  H*P ,  and  H  have  relatively  large  magnitudes  with 
H  and  H*P  having  negative  signs.  Some  of  the  temperature  variables  have 
moderate  to  luge  magnitudes  for  small  k  and  some  of  the  pressure  variables 
have  moderate  to  large  ones  for  the  larger  k  values. 

[Insert  Table  6] 

The  ridge  estimates  in  Table  6  represent  a  continuous  damping  of  the 
multicollinearities  rather  than  the  discrete  inclusion  or  exclusion  of  each 
as  with  principal  component  and  latent  root  regressions.  The  ridge  estimates 
thereby  represent  a  compromise  among  various  principal  component  or  latent 
root  estimates.  With  this  data  set  even  the  small  value  of  the  ridge  para¬ 
meter,  k  =  .005,  results  in  a  ridge  estimate  which  is  compromise  between 
the  deletion  of  four  or  five  multicollinearities;  i.e.,  even  this  small  k 
value  drastically  reduces  the  weights  associated  with  the  first  four  or 
five  latent  vectors  of  X'X.  Comparing  the  ridge  estimates  for  k  =  .005  with 

the  principal  component  (Table  2)  and  latent  root  (Table  5)  estimates,  except 
2 

for  P  and  P  each  of  the  individual  ridge  estimates  lies  between  the  values 
of  the  corresponding  principal  component  or  latent  root  estimates  for  four 
and  five  deleted  components.  For  example,  the  ridge  estimate  for  the  co- 

a 

efficient  of  H,  -.733,  is  between  -1.372  and  -.561  for  Spc  and  between  -1.373 

“  2 
and  -.509  for  8  Even  the  ridge  estimated  coefficients  for  P  and  P  are  very 

close  to  the  corresponding  ones  for  the  other  two  biased'  estimators . 


Table  6.  Ridge  Estimates  for  NO^  Emissions  Data,  Quadratic  Fit 


Predictor 

Variable 

Ridge 

Parameters 

k  -  0 

k  =  .005 

k  =  .01 

in 

o 

It 

o 

H 

n 

Humidity  (H) 

44.738 

-.733 

-.568 

-.262 

-.181 

Pressure  (P) 

69.736 

.151 

.143 

.169 

.176 

Temperature  (T) 

.734 

-.161 

-.090 

-.016 

-.003 

Pxt 

-1.710 

-.227 

-.128 

-.014 

.005 

HXP 

-46.019 

-.713 

-.567 

-.268 

-.185 

HXT 

.797 

.359 

.145 

-.044 

-.057 

P2 

-69.005 

.076 

.107 

.164 

.174 

T2 

.971 

.386 

.239 

.072 

.044 

H2 

.458 

.754 

.669 

.312 

.176 

R2 

.835 

.789 

.778 

.738 

.719 

;2 

.00241 

.00308 

.00324 

.00383 

.00410 

t 
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When  k  is  increased  to  .01,  the  ridge  estimates  become  closer  to  the 

previous  two  biased  estimates  in  which  five  components  are  deleted.  When 

k  =  .05,  the  ridge  estimates  whose  magnitudes  are  greater  than  .10  all  lie 

between  those  of  8_„  and  8, „  for  five  and  six  components  deleted,  as  do  all 
-PC  -LR 

the  ridge  estimates  for  k  =»  .10.  Numerically  these  trends  reinforce  the 
theoretically  apparent  relationship  shown  in  egn.  (3.11):  as  k  increases, 

A 

the  effects  on  8  of  latent  vectors  corresponding  to  small  latent  roots  are 

*•  RR 

lessened. 

Which  value  of  the  ridge  parameter  should  be  used?  This  is  perhaps  the 
single  most  controversial  question  surrounding  the  application  of  ridge  re¬ 
gression.  Figure  2  contains  a  ridge  trace,  plots  of  the  estimated  coefficients 

as  a  function  of  k,  over  the  range  .0005  <_ k  £  .10.  For  ease  of  viewing,  only 

2 

six  of  the  ridge  estimates  are  plotted,  the  estimates  for  H,  P,  T,  Hxp,  p  , 

2 

and  H  .  The  first  point  plotted  corresponds  to  k  *  .0005.  Even  with  this 
small  a  value  of  the  ridge  parameter  two  of  the  estimates  differ  in  sign  with 
the  least  squares  estimates  (8^  and  8^) .  As  k  increases  8^  also  changes 
sign.  The  magnitudes  of  the  estimates,  moreover,  are  much  smaller  than  the 
least  squares  estimates  over  the  entire  range  of  k  displayed  in  the  figure. 

[Insert  Figure  2] 

Choosing  k  by  the  ridge  trace  method  is  recognized  to  be  highly  sub¬ 
jective.  Following  the  advice  of  Hoerl  and  Kennard  (1970a),  the  trace  appears 
to  have  stabilized  in  the  range  .005  <  k  <  .01.  Either  of  the  first  two 
sets  of  ridge  estimates  in  Table  6,  or  some  intermediate  choice,  could  be 
viewed  as  a  reasonable  choice. 

Apart  from  one’s  determination  of  where  the  trace  stabilizes,  other 
problems  affect  ridge  traces.  Figure  3  displays  three  sets  of  ridge  traces 
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for  two  of  the  coefficient  estimates,  0^  and  6g.  The  three  curves  for  each 
estimate  differ  only  with  respect  to  the  range  of  k  that  is  plotted: 

.01  <  k  <  1.0  (p  =  1.0),  .001  <  k  <  .10  (p  =  .10),  and  .0001  <  k  <  .01 
(p  =  .01)  .  Decisions  regarding  when  the  traces  stabilize  appear  to  be  de¬ 
pendent  on  which  range  of  k  one  plots. 

[Insert  Figure  3] 

Stochastic  and  nonstochastic  rules  for  selecting  ridge  parameter  values 
have  frequently  occurred  in  the  statistical  literature  over  the  last  ten 
years  (e.g.,  Dempster,  Schatzoff,  and  Wermuth  (1977),  Hoerl,  Kennard,  and 
Baldwin  (1975) ,  and  McDonald  and  Galarneau  (1975) ) .  Numerous  as  these  rules 
are,  they  also  tend  to  suggest  different  choices  of  the  ridge  parameter.  For 
example,  Hoerl,  Kennard,  and  Baldwin  (1975)  recommended  a  stochastic  estimator 
of  k  which  has  also  performed  reasonably  well  in  other  simulations,  (e.g. 

Gibbons  (1978) ,  Gunst  and  Mason  (1977a) ) .  For  this  data  set  the  estimate  of 
k  is  | 

A  A  A 

k  *  pc  /$'  0TO  -  .000002.  il 

j 

A  nonstochastic  rule  for  selecting  k  which  has  been  advocated  is  to  choose  k 
so  that  the  largest  variance  inflation  factor  is  less  than  some  convenient 
value,  say  10.  For  the  emissions  data  the  resulting  value  of  the  ridge  para- 

) 

meter  is  k  *  .02.  Whether  one's  preference  is  to  use  either  of  these  values  i 

j 

or  one  from  a  ridge  trace  depends  on  one's  familiarity  and  experience  with 
the  various  procedures .  For  many  data  sets  in  which  the  multicollinearities 
are  not  as  numerous  or  severe  as  this  one,  all  these  techniques  can  yield 
ridge  estimates  that  are  similar  and  the  choice  of  a  specific  rule  is  not 
critical.  For  this  data  set  -  as  with  the  choice  of  a  specific  principal 
component  or  latent  root  estimate  -  the  choice  is  important,  as  the  differ¬ 
ences  in  the  estimates  in  Table  7  attest. 

/ 


[Insert  Table  7] 


Although  the  foregoing  discussion  seems  to  lead  to  a  morass  of  con¬ 
fusion,  it  is  intended  to  stress  a  point:  ridge  regression  cannot  be  mech¬ 
anically  applied  without  regard  to  characteristics  of  the  data  set  being 
analyzed.  This  warning  was  stressed  with  reference  to  biased  estimators 
in  general  in  Section  2  and  with  special  reference  to  principal  component 
and  latent  root  regression  earlier  in  this  section,  but  it  cannot  be  over¬ 
emphasized.  Thus  while  the  stochastic  rule  proposed  by  Hoerl,  Kennard,  and 
Baldwin  has  performed  well  in  their  simulation  and  others,  it  is  not  recom¬ 
mended  for  this  data  set.  Since  the  emissions  data  is  so  highly  multicolli- 
near  and  the  least  squares  estimates  are  thereby  inflated,  the  denominator 

*  A 

of  k  is  much  larger  than  it  should  be  and  k  is  extremely  close  to  zero. 

The  corresponding  ridge  estimates  still  have  magnitudes  and  signs  that  are 
inconsistent  with  the  investigators'  a  priori  suspicions. 

Use  of  one  of  the  values  suggested  by  the  ridge  trace  in  Figure  2, 
k  *  .005,  yields  ridge  estimates  that  are  more  reasonable  than  the  stochastic 
estimate  of  k.  Even  so,  one  could  question  the  relative  magnitudes  of  the 
non-humidity  coefficients  and  the  sign  on  HxT  in  light  of  the  suspected 
dominance  of  humidity  and  the  correlations  in  X'X.  Use  of  the  other  value 
suggested  by  the  ridge  trace,  k  ®  .01  (see  Table  6),  or  the  value  suggested 
by  the  maximum  variance  inflation  factor  criterion,  k  *  .02,  lessens  these 
last  concerns.  All  three  of  these  ridge  estimates  (viz.,  k  *  .005,  .01,  .02) 
are  similar  and  the  choice  should  be  based  on  external  information  insofar 
as  possible.  Here,  however,  the  final  choice  is  not  as  critical  as  it  was 

A 

between  least  squares,  the  ridge  estimates  based  on  k,  and  one  of  these 


three 
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Table 

7. 

Ridge  Estimates  for  Several  Selection 

Rules 

Selection 

of  the  Ridge  Parameter 

Predictor 

Stochastic  Rule 

Ridge  Trace 

Maximum  VIF 

Variable 

k  =  .000002 

k  =  .005 

k  =  .02 

Humidity 

(H) 

1.058 

-.733 

-.420 

Pressure 

(P) 

23.740 

.151 

.151 

Temperature 

(T) 

1.858 

-.161 

-.047 

pxT 

-3.398 

-.227 

-.063 

H*P 

-3.004 

-.713 

-.426 

H*T 

.592 

.359 

.020 

P2 

-23.304 

.076 

.135 

T2 

1.597 

.386 

.144 

H2 

.963 

.754 

.528 

R2 

.826 

.789 

.763 

”2 

<7 

.00255 

.00308 

.00346 

4.  COMMENTS  AND  CONCLUSIONS 


Biased  estimation  should  be  viewed  as  an  important  alternative  to 
least  squares  estimation  of  the  parameters  of  a  multiple  linear  regression 
model.  Over  the  last  decade  biased  estimators  have  been  shown  to  possess 
valuable  theoretical  and  empirical  properties  which  can  be  especially  ad¬ 
vantageous  when  predictor  variables  are  multicollinear .  Although  problems 
persist  with  the  application  of  biased  estimation,  their  widespread  popu¬ 
larity  attests  to  the  successful  implementation  of  biased  regression  metho¬ 
dologies  . 

Criticisms  of  biased  estimation  in  regression  (e.g.  Coniffe  and  Stone 
(1973),  Draper  and  Van  Nostrand  (1979),  Smith  and  Campbell  (1980))  serve  at 
least  two  useful  purposes.  First  they  rightfully  attack  the  view  that  bi¬ 
ased  estimation  provides  a  panacea  for  all  the  problems  inherent  in  regres¬ 
sion  analyses.  In  particular,  they  deplore  the  simplistic  view  that  one 
can  axleviate  the  difficulties  stemming  from  multicollinear  data  sets  by. 
for  example,  merely  selecting  a  value  of  the  ridge  parameter.  Second,  they 
focus  attention  on  unresolved  theoretical  questions  associated  with  biased 
estimators.  Just  as  Theobald  (1974)  provided  a  solution  to  one  such  ques¬ 
tion,  further  research  can  conceivably  make  progress  in  answering  others. 

This  paper  has  attempted  to  illuminate  the  benefits  and  deficiencies 
of  biased  estimation  with  special  reference  to  the  analysis  of  regression 
data.  Although  criticisms  have  been  levied  at  the  current  incomplete 
theoretical  justification  for  biased  estimation,  attention  has  been  directed 
in  this  paper  to  the  clear  advantages  of  three  biased  estimators  over  least 
squares  on  the  emissions  data.  In  spite  of  problems  with  the  application 
of  the  biased  estimators  -  problems  which  are  acknowledged  and  illustrated  - 


all  three  biased  estimators  result  in  coefficient  estimates  which  are  more 
reasonable  than  least  squares .  The  ultimate  criteria  for  measuring  the 
efficacy  of  a  regression  estimator,  one  can  argue,  are  ones  that  are  im¬ 
possible  to  precisely  define  and  can  only  be  subjectively  measured:  do 
the  coefficient  estimates  make  sense  from  the  physical  nature  of  the  pro¬ 
blem  and  does  the  final  prediction  equation  predict  accurately  enough? 

Biased  regression  estimates  often  satisfy  both  of  these  criteria. 

One  final  comment  on  the  emissions  data  is  in  order.  It  was  selected 

for  use  in  this  paper  solely  because  it  clearly  illustrates  the  problems 

associated  with  multicollinearities  and  the  effects  of  the  three  biased 

estimators.  As  was  mentioned  in  Section  2,  there  are  many  alternatives 

to  least  squares  with  this  data.  One  alternative  is  to  exclude  interaction 

and/or  quadratic  terms  because  they  are  so  highly  correlated  with  the  linear 

ones.  Hamaker  (1962)  provides  an  example  (in  a  variable  selection  context) 

of  a  data  set  in  which  an  incorrect  theoretical  model  would  be  fit  if  such  an 

alternative  was  adopted.  In  addition,  all  the  biased  estimators  used  on  the 

2 

emissions  data  suggest  retaining  H  in  spite  of  its  high  correlation  with  other 
humidity  variables .  It  is  also  true  that  the  severe  multicollinearities  dis¬ 
appear  if  the  linear  terms  are  standardized  prior  to  the  formation  of  inter¬ 
action  and  quadratic  terms.  So  long  as  one  recognizes  that  the  model  para¬ 
meters  are  transformed  as  well,  this  is  another  possible  alternative.  Again, 
the  failure  to  consider  these  alternatives  is  based  on  the  use  of  this  data 
set  to  illustrate  properties  of  the  various  regression  estimators. 
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